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Abstract 

Background: Genome-wide association studies have identified many individual genes associated with complex 
traits. However, pathway and network information have not been fully exploited in searches for genetic 
determinants, and including this information may increase our understanding of the underlying biology of 
common diseases. 

Results: In this study, we propose a framework to address this problem in a principled way, with the underlying 
hypothesis that complex disease operates through multiple connected genes. Associations inferred from GWAS are 
translated into prior scores for vertices in a protein-protein interaction network, and these scores are propagated 
through the network. Permutation is used to select genes that are guilty-by-association and thus consistently 
obtain high scores after network propagation. We apply the approach to data of Crohn's disease and call 
candidate genes that have been reported by other independent GWAS, but not in the analysed data set. A 
prediction model based on these candidate genes show good predictive power as measured by Area Under the 
Receiver Operating Curve (AUC) in 10 fold cross-validations. 

Conclusions: Our network propagation method applied to a genome-wide association study increases association 
findings over other approaches. 




Background 

In recent years, genome-wide association studies 
(GWAS) have become a common tool to discover the 
genetic basis of complex diseases and have led to many 
scientific discoveries [1]. Many single nucleotide poly- 
morphisms (SNPs) have been identified in a variety of 
diseases. The single marker analysis tests genetic associa- 
tion of individual SNPs and identifies only the most signif- 
icant SNPs below a stringent significance level, for 
example, p <5 x 10 r 8 , which is necessary to control the 
false positive rate on a genome-wide level. However, 
the identified SNPs only represent a small fraction of the 
genetic variants to contribute to complex diseases, due to 
small individual effect sizes. Markers that are truly but 
weakly associated with disease often fail to be detected [2]. 

It is well understood that the stability of biological 
systems is governed by many biomolecular interactions 
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and multi-gene effects should be taken into considera- 
tion while mapping from genotypes to phenotypes. 
Consider a crucial biological mechanism, where failure 
of a small portion of the important genes can lead to 
dys-function of the whole biological mechanism. This is 
very likely to happen in case of complex disease, such as 
Crohn's Disease, and therefore multi-locus analysis show 
increase of power when analyzing such data. 

Pathway-based or gene set enrichment analysis has 
become a potentially powerful approach in search of dis- 
ease associated genes (for a recent review of pathway 
analysis, see [3]). One of the most popular methods is 
GSEA [4]. Using a modified Kolmogorov-Smirnov test, it 
compares the p value distribution of genes in a pathway 
with the rest of the genes. GSEA has successfully identi- 
fied the IL-12/IL-23 pathway that is significantly asso- 
ciated to Crohn's disease [5]. However, there are some 
disadvantages to the common pathway based analysis. 
First, most of the studies choose the most significant 
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SNP from each gene as a representative, and therefore 
systematic but small changes in a gene-set will be missed 
if individual genes do not have any SNPs with strong 
marginal association. Second, how robust the methods 
are with regard to factors such as pathway annotations 
and pathway size is not clear. Third, many methods treat 
all genes equally despite that some genes (e.g., house- 
keeping genes) appear in many pathways [6]. The addi- 
tional information of genes in overlapping pathways 
should not be ignored, as shown in a study where weight- 
ing genes based on their appearance in the gene sets can 
improve gene set ranking and boost sensitivity of the ana- 
lysis [7]. 

Similar to pathway-based analysis, where biologically 
relevant connections from public databases are utilised in 
GWAS, network-based analysis has also become a popular 
tool for the study of complex disease. In context of mole- 
cular interaction networks, it has been found that about 
one third of known disorders with multiple genes show 
physical clustering of genes with the same phenotype and 
these clusters are likely to represent disorder-specific func- 
tional modules [8]. A concept of disease modules was 
emerged as more studies show that proteins that are 
involved in the same disease show a high propensity to 
interact with each other [9,10]. If a few disease compo- 
nents are identified, other disease-related components are 
likely to be found in their network-based vicinity. For 
example, various module search methods have been devel- 
oped in search of disease associated modules [11,12]. 
However, a module consists of an arbitrary number of 
genes, it often requires intensive simulations for multiple- 
testing corrections. 

Network Guilt by Association (GBA) is an approach 
for identifying disease genes based on the observation 
that similar phenotypes arise from functionally related 
genes. Algorithms related to Google's PageRank, such as 
Iterative Ranking and Gaussian smoothing, are applied 
in prioritizing candidate disease genes using network 
information [13]. A typical workflow looks like this: 
given a query disease, known causal genes of diseases 
that are phenotypically similar to the query disease are 
given a prior score in human PPI network, then the 
prior scores are propagated and smoothed over network 
such that each protein gets an association score. Genes 
with high association scores are considered as candidate 
associations. 

In this study, we analyzed a GWAS in a GBA frame- 
work. The network is overlaid with GWAS information, 
that is, each gene is assigned a prior score based on the 
gene level p value from GWAS, which represents our 
prior knowledge of its association to the disease. After 
propagation, the prior score has been smoothed over the 
whole network and each gene gets a new association 
score, denoted as the posterior score, with higher posterior 



score representing stronger evidence for association. If a 
gene has many neighbours that are associated with the dis- 
ease, it is very likely that itself is also associated. A gene 
with high posterior score can be called as candidate genes, 
even though it has a low prior score and fails to be called 
in standard GWAS due to stringent p value cut-off, chip 
coverage or sampling bias. 

A recent study applied similar ideas to prioritize candi- 
date disease genes and demonstrated a boost in the power 
to detect associated genes in GWAS [14]. Using a naive 
Bayes framework on datasets of Crohn's disease and Type 
1 diabetes, the posterior score of each gene was obtained 
by adding its own log odds from GWAS (as prior) and a 
soft GBA score from neighbors. The study showed that 
some genes with high posterior scores are actually vali- 
dated as true associations in later studies, although they 
do not have highest prior scores (e.g., lowest p values in 
GWAS studies) and would possibly be ignored in two- 
stage studies. However, there are some open questions in 
this study. First, the posterior score of a gene depends not 
only on its neighbors association of disease, but also on 
how many neighbors it has, i.e, network topology. It is 
appealing to mark a gene with high posterior score as 
associated, neglecting the fact that a high posterior score 
is merely due to a high degree (receiving information from 
more neighbors). Second, there is no statistic control (e.g., 
false discovery rate) for the findings. If there is no signal in 
the network, i.e, a completely random prior for all the 
genes, this method still outputs the genes with highest 
posterior scores. 

Methods 

In this section, we first describe the data we used and 
the network propagation framework, then we build a 
prediction model based on the GBA genes, and evaluate 
the performance in cross-validation (CV). 

Data Set 

Prior information from GWAS. We analyzed the raw 
anonymous genotype data of the Wellcome Trust Case 
Control Consortium (WTCCC) study. The original cohort 
includes 2005 Caucasian UK patients of Crohn's disease 
and 3004 controls genotyped on the Affymetrix 500K map- 
ping array. The details are described in [15]. Genotypes 
with posterior probability (or CHIAMO score) lower than 
0.98 are considered as missing data. Markers are removed 
if the percentage of missing data was larger than 5% or if 
they are not in Hardy- Weinberg equilibrium (p > 0.0001 
for control group). We further remove some individuals 
with missing allele larger than 3% or of non-European 
ancestry or with duplicated samples, as suggested by [15], 
and are left with 1748 cases and 2938 controls. Finally we 
map a SNP to a gene if it was located within the gene or 
lOkb immediately upstream or downstream. 
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Interaction network. The PPI network is built based 
on the STRING database version 9.0 [16]. Only interac- 
tions with a score larger than 700 are included, and it 
results in 229599 interactions involving 15010 proteins. 
We use proteins and genes interchangeably in the follow- 
ing, because SNPs were first mapped to genes then 
mapped to corresponding proteins. The GWAS dataset 
only covers part of the genes in the network, for example, 
11363 genes for Crohn's disease. We discard all the ver- 
tices that are not covered by GWAS, keeping only edges 
between covered vertices. Isolated nodes are also removed, 
in the end, we are left with a large connected network N . 

Propagation of evidence 

Consider a PPI network as an undirected graph 
G = (V, E, w), where nodes V are a set of proteins, 
edges E are links between proteins if interaction exists, 
W is the weight of an edge. For a node v e V, denote 
its total number of neighbors by degree(v) and its 
direct neighbors in G by N (v). Let Y : V — » R> 0 repre- 
sent a function of prior evidence, i.e., assign high score 
to a node v if we a priori (from GWAS) believe that it 
is associated to the disease. F : V -> R> 0 denotes a 
function of posterior evidence, i.e., F (v) represents the 
posterior evidence of association after propagating the 
information of its neighbor nodes. The main three 
steps of network propagating are, (1) obtaining prior 
information from GWAS, (2) calculating and normalis- 
ing posterior scores by network propagation (with 
choice of tuning parameters), (3) selecting genes with 
highest posterior score as candidates. 

Gene-level prior scores. The p value of a gene is 
defined as the minimum single marker test p value of 
its SNPs, as widely used in pathway analysis. The prior 
score of gene i defined as j, = <I> _1 (1 - p/2) and ® is 
the Cumulative Distribution Function of normal distri- 
bution. Therefore, under the null hypothesis of no asso- 
ciation, Yi ~ Af + (0, 1). According to [17], minimum 
p values performs best in most scenarios, we also tried 
Fisher's combined probability test, however, it gives 
lower internal consistency of gene ranks for random 
subset of the data. 

Calculating posterior scores by propagation. The 
posterior score F is computed as 

F(v) =a[J2 Hu)u>' VlU ] + (1 - or)Y(v) 

ueN{v) 

where the parameter ae (0, 1) weights the relative 
importance of information received from neighbors, and 
uf vu = w ViU /y/d(y) x d(u) denotes the weight of edges, 
with d{v) the degree of node v. The above formula can 
be expressed in linear form F = aWF + (1 - a)Y, which 
is equivalent to 



F={I-aW'y 1 (l-a)Y (1) 

It can be proved that W is similar to a stochastic 
matrix, which has eigenvalues in [-1, 1] (according to 
the PerronFrobenius theorem). Since ae (0, 1), the 
eigenvalues of (/ - aWy 1 exists. Though the linear 
equation can be solved analytically, it is difficult to com- 
pute the inverse of a large matrix with \ V\ x \V\ dimen- 
sion, and we choose a iterative propagation method to 
solve the system. At iteration t, we compute 

F £ = aWF'' 1 + (1 -a)Y (2) 

Tuning parameters. There are two tuning parameters 
in the model, aand T. T denotes the number of itera- 
tions of propagation. We study two extreme scenarios, 
T = 1 and T = °°. When T = 1, each node only receives 
information from its direct neighbors. Posterior score F 
is calculated from equation (2). When T = °°, equation 
(2) reaches equilibrium after many iterations and the 
information is smoothed over the network, therefore 
each node also gains information from its indirect 
neighbors through iterations. F scores will converge as 
shown in equation (1). In practice, equilibrium is often 
achieved within 20 iterations. 

a, also known as the damping parameter in the litera- 
ture, denotes how much information a node receives from 
neighbors. Higher crindicates less weight on its own prior 
information. Previous applications of similar algorithms of 
ranking SNPs or genes recommend ae [0.5, 0, 95] [13,18]. 
We explored ae (0.2, 0.4, 0.6, 0.8, 0.9) in the experiments. 
A good choice of ashould give better internal consistency 
of gene rank by posterior score, which can be measured by 
Kendall Tau rank correlation [19]. 

We randomly choose half of the case and control 
samples and rank genes based on posterior scores. Low 
consistency is obtained for a e (0.2, 0.4) and it agrees 
with a previous study that a should be more than 0.5 
[18]. Highest consistency is obtained when ae (0.8, 0.9), 
thus we choose the mean a= 0.85 in the main analysis, 
unless otherwise specified. 

Identifying associated genes 

As shown in Equation (2), a gene can have a high posterior 
score under two conditions: (1) it has a high degree in the 
network and receives more information from its neighbors 
than the other low degree genes, (2) most of its neighbors 
are associated with the disease and itself is GBA. Therefore 
ranking a gene based on posterior score has some issues 
when the first condition is dominant, we may include too 
many false positives in the candidate gene set and have a 
potential power loss for genes with lower degrees in the 
network. 

Here we suggest a framework, that uses permutations 
to identify GBA genes and eliminates potential false 
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positives. The pseudocode is shown in the following. 
Input parameters include: significant threshold and 
number of permutations, K. 

example input: Sig. Thresh.=0.01, 7<C=10000; 

// GWAS prior 

prior scores based on p values of Cochran- Armitage 
trend test in GWAS data; 

propagate prior scores by equation (2); 

normalize posterior scores for all genes so that the 
sum is 1; 

record posterior score for gene, as S ! 0 ; 
// permutation 
for k in 1 to K; do 

permute case and control labels, calculate prior scores 
from p values; 

propagate prior scores by equation (2); 

normalize posterior scores for all genes so that the 
sum is 1; 

record posterior score for gene, as S' k ; 
done 

// find candidate GBA genes 
for gene, in network; do 

the p value of the posterior score of gene, is 

ELJ(sj>s{,) 

K 

gene, is candidate association if the p value of its pos- 
terior score is smaller than Sig. Thresh, 
done 

Prediction model and ROC 

The GBA candidate genes are important nodes in the 
network, because many of their neighboring genes are 
associated with disease. To measure how the GBA 
genes collectively contribute to the disease, we used 
them to build prediction models and evaluate the per- 
formance in 10-fold cross-validation (CV). The predic- 
tion models based on 90% of the cases and controls 
were tested on the remaining 10% data, and it was 
repeated 10 times with different 90% and 10% of the 
cohorts. 

A logistic regression model with all the SNPs as cov- 
ariates is fitted by the R package glmnet [20]. Though 
the GBA framework only chose the most significant 
SNP to represent the gene, we used all the SNPs 
located within 10Kb boundary of the candidate genes. 
glmnet applies cyclical coordinate descent to solve elas- 
tic-net penalized regression models, which are mix- 
tures of two penalties: l\ (the lasso) and l 2 (ridge 
regression), and it generates models with relatively few 
predictors. To evaluate the performance of the predic- 
tive model, we calculated the average Area Under the 
receiver operating characteristic Curves (AUC) [21] for 
all 10 trials. 



Results and discussion 

Problems with association by rank 

Network propagation for prioritizing associated genes 
has been applied in several studies when there is func- 
tional similarity between a given gene and the known 
disease gene. The selected few known disease associated 
genes give prior information for the network, after pro- 
pagation, each gene gets a posterior score, which repre- 
sents its association to the disease. 

Implementation of similar ideas in a GWAS showed 
boosting in identification of disease-associated genes 
[14]. However, in such an application, genes of high 
degree often have high posterior scores due to propaga- 
tion. If we simply take the genes with highest posterior 
score as candidates, we may end up including too many 
false positives in the candidate gene list. In Additional 
file 1: Table SI, where N 0 refers to network with the 
GWAS prior and N k *o to networks with randomized 
prior, one can see that the top ranked genes in N 0 are 
also often ranked top in N k ^ 0 . Although the detailed 
implementation of our study and the one by [14] is dif- 
ferent, it reveals the necessity of utilizing such network 
methods in a more cautious way. Moreover, most of the 
top ranked genes in N 0 have high network degree, with 
average degree of 195.4, while the PPI used in our stu- 
dies has an average degree of 24.5, it again confirms our 
concern that genes of high degree tend to be ranked on 
top in the network. 

Candidate genes 
Identified candidate genes 

The study by [14] listed the top-ranked 150 genes, to 
make the results comparable, we also made a list of 150 
top-ranked genes with 7=1, they are ranked by the 
p value of their posterior score obtained from permuta- 
tion. As shown in Additional file 1: Table S2, one can 
see that many genes with significant p value from 
GWAS are also called in our study. This is not surpris- 
ing since the prior scores from GWAS contribute to the 
posterior scores. Moreover, our method also identified 
some genes that are not called by standard methods in 
this data set, but reported in other independent GWAS. 
The number of genes that are validated by other studies 
and reported in GWAS catalog [22] is shown in Figure 1. 
There are 7 genes identified as association candidates in 
our study, which failed to be called by GWAS p value 
ranking, they are PTPN22, IRF1, PTGER4, IL12B, 
IL18R1, FASLG and TAK2. Except IL12B and JAK2, the 
other 5 genes also failed to be called by [14]. These can- 
didate genes all have a higher rank of posterior scores in 
the network of GWAS prior, compared to network with 
random prior, as shown in Additional file 1: Table S2. 
For example, gene TAK2 (MIM 147796) has a gene level 
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our study 

Figure 1 Number of reported genes. Overlap of validated genes 



among top 150 genes for each method, all of them map SNPs to 
genes within 1 0kb boundary. Validated genes denote genes that 
are reported as candidate in later GWAS studies. Method GWAS 
means ranking genes by minimum p value. 



p value of 0.0523, but its rank in the network increases 
from average of 104 in N kr0 to 54 in N 0 . IL18R1 (MIM 
604494) has a GWAS p value of 0.003, and its rank of 
posterior score is 3342 in N kr0 and 1109 in N 0 . This gene 
would be missed by both single marker test as well as 
ranking genes by posterior score. 

There are 4 genes, C13orf31, MST1, RTEL1 and HLA- 
DQA2, that were listed on top 150 by [14] and failed to 
be called in our study. The reason we failed to call them 
is because these genes are not present in the STRING 
database we used. Therefore a more complete PPI data- 
base integrated from diverse sources is needed, as was 
done by [14]. 

Based on the number of significant genes that are 
reported by other independent GWAS or meta-analyses, 
there seems to be no big advantages of our results com- 
pared with the one by [14]. However, these studies are 
designed to discover different signals in the network. 
Our method focus more on genes with abundant guilty 
neighbors, thus it has higher power to discover local sig- 
nals even with low degree genes but less power to call 
high degree genes. However, there are also two other 
factors in the approaches that may contribute to the dif- 
ference in performance. First, the detailed implementa- 
tion is different, [14] rank genes by posterior log odds, 
and choose a best prior log odds such that the number of 
validated genes is maximized. Second, the reported genes 
in GWAS catalog might have a bias to genes that have 
significant p value in certain type of study (eg., certain 



type of chips). Nevertheless, our study showed potential 
power of GBA framework to boost GWAS signals. 
Genomic prediction model 

We further investigated whether the candidate genes 
collectively contribute to the disease and evaluated the 
extent to which predictions were driven by these candi- 
date genes. A previous study conducted pathway analy- 
sis on the same data set [23] and built a logistic 
regression model of 277 genes with a variable selection 
algorithm. The average AUC was 0.6 in 10-fold CV with 
all the SNPs within lOkb boundary of the selected 
genes, and it dropped to 0.56 after excluding all SNPs 
with p value <5 x 10~ 7 . Using GBA candidate genes, we 
had higher average AUC in 10-fold CV, the average is 
0.705 (T = 1) and 0.730 (T=°o) in models including all 
the SNPs that are mapped to candidate genes, and 0.687 
(T = 1) and 0.715 (T=°o) after removing SNPs with 
p value <5 x 10~ 7 . The numbers are shown in Table 1, 
one can see that the increased AUC is not due to the 
number of genes we used to build the model, on the 
contrary, we used fewer genes for the prediction model, 
with 130 (T = 1) and 184 (T=°o) genes respectively. 

We also built prediction models with the same num- 
ber of genes that are ranked on top (based on posterior 
scores), which would be called candidate genes in the 
study of [14]. As shown in Table 1, there is a AUC drop 
with models of top ranked genes. The reason might be 
that the candidate genes in our study are GBA genes 
and collectively contribute to disease, while top ranked 
genes can be special in the network topology but with 
no association to the disease. 

Boosting signal from IL12 pathway genes 

Many studies of pathway analysis have uncovered signif- 
icant associations between Crohn's disease and the 
IL12/23 signaling pathway [5,24]. 19 of 20 genes in IL12 
pathway are included in our network analysis, most of 
them have a posterior score (rank) increase in N 0 , 
shown in Additional file 1: Table S3. As propagation 
redistributes the information of the network, standard 
pathway enrichment analysis based on posterior scores 
might have an advantage over the one based on direct 
GWAS results. 

Conclusions 

Combining GWAS data with function databases is very 
appealing as it provides more explanatory power for the 
list of candidate genes. While pathway methods have 
shown success in many applications, they also have lim- 
itations. For example, genes involved in multiple path- 
ways might introduce bias in different pathways, different 
definitions of the same pathway in different knowledge 
bases can affect performance assessment in terms of 
power and true positive/negative rate. [3]. 
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Table 1 Prediction models based on candidate genes. 









T = 1 












T=oo 




method 


Gene" 


SNP1 fc 


SNP2 C 


AUC1 d 


AUC2 C 


Gene 


SNP1 


SNP2 


AUC1 


AUC2 


Sig. Thresh/ 


130 


1176 


1145 


0.705(0.033) 


0.687(0.032) 


184 


1314 


1281 


0.730(0.017) 


0.715(0.013) 


PS 9 


130 


2331 


2310 


0.645(0.024) 


0.627(0.026) 


184 


3279 


3256 


0.645(0.024) 


0.626(0.026) 



"Gene is number of genes identified as GBA associations, 
b SNP1 is number of SNPs mapped to these genes. 

C SNP2 is 'SNPI excluding snps that reach GWAS significance of 5 x 1CT 7 . 

''AUCI, mean AUC of prediction models built with 'SNP1, numbers in parentheses are standard deviation in cross validations. 
"AUC2, similar to AUCI, prediction models built with C SNP2. 
f Sig. Thresh. = 0.01 

9 PS means genes with highest posterior scores are selected. 



The methods of combining GWAS and PPI networks 
mainly fall into two categories, (1) dense module search 
algorithms in search of significantly enriched subnetworks 
[11,12,25], (2) propagation algorithms related to Google's 
PageRank [13,14] in search of genes that have top ranks in 
the network. However, while methods in group (1) require 
intensive randomization of network topology for accessing 
module significance, and often encounter multiple testing 
problem in searching of modules of various sizes in high 
dimension space, methods in group (2) fail to distinguish 
signals from GWAS and signals from network, and there- 
fore tend to have a high false positive rate, especially in 
case of biased PPI database. 

Our study extends the idea of network propagation with 
GWAS information, such that information from various 
resources can be utilized. The performance of this method 
can be improved in various ways. Integration of diverse 
data sources, as suggested in [14], will improve the ability 
to prioritize disease genes. Mapping multiple SNPs to a 
single gene is the simplest way of obtaining genelevel sta- 
tistics, yet some collapsing-based and kernelbased meth- 
ods are worth trying for gene-level statistics [26] . There 
are also potential extensions of this study. For example, 
most of the 20 major genes in IL12 pathway, identified as 
associated in Crohn's Disease, have increased posterior 
score in the network of GWAS prior, it implies that net- 
work propagation method redistributes the information in 
the network where the true associations get enriched 
information. Therefore pathway analysis based on poster- 
ior scores may have more power than the standard path- 
way analysis. Many methods that detect interaction and 
epistasis, such as Support Vector Machine [27] and Logic 
Regression [28] are not applicable in genome-wide scale 
due to high dimensions, a reduced search space such as 
interactions among GBA genes might yield some results. 

Additional material 



Additional file 1: Genes with highest posterior scores Table SI lists 
genes with highest posterior scores in the network, with parameter of 
7=1. Candidate gene set of top 1 50 genes. Table S2 lists the top 1 50 
candidate GBA genes in our study, which are used for comparison with 



other methods. Genes in IL12 pathway. Table S3 lists 19 genes in IL12 
pathway, most of them have an increased posterior score in the network 
of GWAS prior. 
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